pacman::p_load(ggplot2, ggiraph, plotly,
patchwork, DT, tidyverse,
ggrepel, ggthemes, hrbrthemes,
tidyverse,ggstatsplot,pals) Take-home_Ex01
1. The task
The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.
2. Data preparation
2.1 Installing the data packages
The unzipped files have been saved into a new folder named data for better organization.
part_info <- read_csv("data/Participants.csv")Participants.csv
Contains information about the residents of City of Engagement that have agreed to participate in this study.
- participantId (integer): unique ID assigned to each participant.
- householdSize (integer): the number of people in the participant’s household
- haveKids (boolean): whether there are children living in the participant’s household.
- age (integer): participant’s age in years at the start of the study.
- educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
- interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
- joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")FinancialJournal.csv
Contains information about financial transactions.
- participantId (integer): unique ID corresponding to the participant affected
- timestamp (datetime): the time when the check-in was logged
- category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
- amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link
Lets first examine the properties of the participants csv file.
datatable(part_info)str(part_info)spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
$ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
$ haveKids : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
$ age : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
$ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
$ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
$ joviality : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. householdSize = col_double(),
.. haveKids = col_logical(),
.. age = col_double(),
.. educationLevel = col_character(),
.. interestGroup = col_character(),
.. joviality = col_double()
.. )
- attr(*, "problems")=<externalptr>
Only displaying the meaningful summary
summary(part_info[, c("householdSize", "haveKids", "age", "educationLevel", "interestGroup", "joviality")]) householdSize haveKids age educationLevel
Min. :1.000 Mode :logical Min. :18.00 Length:1011
1st Qu.:1.000 FALSE:710 1st Qu.:29.00 Class :character
Median :2.000 TRUE :301 Median :39.00 Mode :character
Mean :1.964 Mean :39.07
3rd Qu.:3.000 3rd Qu.:50.00
Max. :3.000 Max. :60.00
interestGroup joviality
Length:1011 Min. :0.000204
Class :character 1st Qu.:0.240074
Mode :character Median :0.477539
Mean :0.493794
3rd Qu.:0.746819
Max. :0.999234
Count of resident for each interest group
table(part_info$interestGroup)
A B C D E F G H I J
102 91 102 96 83 106 108 111 96 116
Count of resident for each education level
table(part_info$educationLevel)
Bachelors Graduate HighSchoolOrCollege Low
232 170 525 84
sum(is.na(part_info))[1] 0
Next we will visualize some variables in document to see if there are any anomalies. From the chart below, we see that there exist no participants with extremely large age or low
Show the code
p1 <- ggplot(part_info, aes(x = age)) +
geom_bar() +
labs(title = "Distribution of Participants' Age",
x = "Age",
y = "No. of person")
fig <- ggplotly(p1)
fig <- fig %>%
layout(xaxis = list(title = 'Age'),
yaxis = list(title = 'No. of person'))
p2 <- ggplot(part_info, aes(x = householdSize)) +
geom_bar() +
labs(title = "Distribution of house hold size",
x = "Household size",
y = "Count")
fig2 <- ggplotly(p2)
fig2 <- fig2 %>%
layout(xaxis = list(title = 'Household size'),
yaxis = list(title = 'Count'))
fig3 <- subplot(fig, fig2, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
layout(title = 'Outlier checking',
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff')) %>%
layout(annotations = list(
list(
x = 0.2,
y = 1.0,
text = "Distribution of Participants' Age",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.8,
y = 1.0,
text = "Distribution of house hold size",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
)
))
fig3duplicates1 <- duplicated(part_info)
part_info[duplicates1, ]# A tibble: 0 × 7
# ℹ 7 variables: participantId <dbl>, householdSize <dbl>, haveKids <lgl>,
# age <dbl>, educationLevel <chr>, interestGroup <chr>, joviality <dbl>
Since 0 rows are displayed. We can confirm that duplicates do not exist in this csv file.
Lets move on to examine the properties of the finance csv file.
Sneak peak of the first few entries in the dataset
head(finance)# A tibble: 6 × 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
Please do not use datatable here or you will face a error of Fatal javascript OOM in reached Heap Limit which indicate that R studio session has run out of memory while attempting to execute JavaScript code.
str(finance)spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
$ timestamp : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
$ category : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
$ amount : num [1:1513636] 2473 -555 -38 2047 -555 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. timestamp = col_datetime(format = ""),
.. category = col_character(),
.. amount = col_double()
.. )
- attr(*, "problems")=<externalptr>
Only displaying the meaningful summary
summary(finance[c("amount")]) amount
Min. :-1562.726
1st Qu.: -5.594
Median : -4.000
Mean : 20.047
3rd Qu.: 21.598
Max. : 4096.526
table(finance$category)
Education Food Recreation RentAdjustment Shelter
3319 790051 296013 131 11463
Wage
412659
sum(is.na(finance))[1] 0
Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.
Show the code
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
geom_boxplot() +
xlab("Expense Category") +
ylab("Amount") +
ggtitle("Amount Spent by Expense Category"))